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DETAILED ACTION 
Information Disclosure Statement 

1. The Applicants' Information Disclosure Statement, filed 28 August 2002, has been received 
and entered into the record. Since the Information Disclosure Statement complies with the 
provisions of MPEP § 609, the references cited therein have been considered by the examiner. See 
attached form PTO 1449. 

The Invention 

2. The claimed invention is a system and method for cleaning a set of hypertext documents in 
order to minimize violations of a Hypertext Information Retrieval rule set. 

Drawings 

3. Receipt of corrected formal drawings, filed 28 February 2002, is acknowledged. These 
drawings are approved by the examiner. 

Claim Rejections - 35 USC § 1 12 

4. The following is a quotation of the second paragraph of 35 U.S.C 1 12: 

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject 
matter which the applicant regards as his invention. 

5. Qaims 4-6 and 16-18 are rejected under 35 U.S.C 112, second paragraph, as being indefinite 
for failing to particularly point out and distincdy claim the subject matter which applicant regards as 
the invention. 
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6. Regarding claims 4, 5, 16 and 17, the claims arc rendered indefinite because the claims, 
particularly in light of the specification at page 13, line 10 through page 14, line 4, recite a process 
that is inconsistent. 

At step 610 (Figure 6), the top node is removed from the queue, and is analyzed to 
determine if the node is a pagelet. However, it is disclosed that one of the criteria of pagelet 
determination is whether the node has any children that are pagelets. The inconsistency is that at 
step 612, in order for a pagelet determination to be made regarding node zj all of the nodes children 
need to undergo the same analysis. However, those nodes are not inserted into the queue for such 
analysis until after it has been determined that node vis not a pagelet, in step 614. Put another way, 
the child nodes are not pushed onto the queue to determine if they are pagelets until after it is 
determined that their parent node is not a pagelet, but that determination cannot take place until the 
child nodes have already undergone this analysis. 

7. Claims 6 and 18, incorporating the deficiencies of their respective parent claims, are also 
rendered indefinite. 

Claim Rejections - 35 USC § 102 

8. The following is a quotation of the appropriate paragraphs of 35 U.S.C 102 that form the 
basis for the rejections under this section made in this Office action: 

A person shall be entitled to a patent unless - 

(b) the invention was patented or described in a printed publication in this or a foreign country or in public use or on sale in 
this country, more than one year prior to the date of application for patent in the United States. 



Application/Control Number 10/055,586 Page 4 

Art Unit: 2177 

9. Claims 1-3, 7, 9, 11, 13-15 and 19 are rejected under 35 U.S.C 102(b) as being anticipated by 
Btoder et al. ("Syntactic Clustering of the Web"). 

10. Regarding claim 1, Broder et al. teaches a method as claimed, comprising the step of 
cleaning, by operations of a computer system, a set of text documents to minimize violations of a 
predetermined set of Hypertext Information Retrieval rules (see section 1 Introduction , beginning 
on page 2; see also section 5.1 Common Shing les, beginning on page 7, particularly the disclosure on 
page 8 that common shingles either have no effect on the overall resemblance of the documents or 
they have the effect of creating a false resemblance between two basically dissimilar documents, and 
so common shingles are ignored). 

11. Regarding claim 13, Broder et al. teaches a computer readable medium including computer 
instructions for driving a user interface as claimed, the computer instructions comprising 
instructions for cleaning, by operations of a computer system, a set of text documents to minimize 
violations of a predetermined set of Hypertext Information Retrieval rules (see section 1 
Introduction , beginning on page 2; see also section 5.1 Common Shing les, beginning on page 7, 
particularly the disclosure on page 8 that common shingles either have no effect on the overall 
resemblance of the documents or they have the effect of creating a false resemblance between two 
basically dissimilar documents, and so common shingles are ignored). 

12. Regarding claim 9, Broder et al. teaches a system as claimed, comprising: 

a) a user interface (see disclosure that the system can be used for filtering the results of Web 
searches, said web searches requiring a user interface, Abstract); 
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b) a user interface/ event manager communicatively coupled to the user interface (see 

disclosure that the system can be used for filtering the results of Web searches, said 
web searches requiring an event handler to respond to user requests, Abstract); 

c) a generic data gathering device (see disclosure that the system can be applied to a group of 

documents found by the AltaVista spider, section 1 Introduction , beginning on page 

2); 

d) a generic information retrieval application, communicatively coupled to the user 

interface/ event manager (see disclosure that the system can be applied to a group of 
documents found by the AltaVista spider, section 1 Introduction , beginning on page 
2); and 

e) a data cleaning application for 

i) decomposing each page of a set of text documents into one or more pagelets (see 
disclosure that documents are decomposed into shingles, analogous to the 
claimed pagelets, in section 2 Defining Similarity of Documents , beginning on 
page 3); 

u) identifying all pagelets belonging to templates (see disclosure that common shingles 
were nearly all mechanically generated, including shared header or footer 
information on a large number of automatically generated pages, i.e. forms, 
analogous to the claimed templates, in section 5.1 Common Shing les, 
beginning on page 7); and 

iii) eliminating the template pagelets from a data set (see section 5.1 Common 
Shing les, beginning on page 7, particularly the disclosure on page 8 that 
common shingles either have no effect on the overall resemblance of the 
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documents or they have the effect of creating a false resemblance between two 
basically dissimilar documents, and so common shingles are ignored), 
communicatively coupled to the generic data gathering application and to the generic 
information retrieval application. 



13. Regarding claim 1 1, Broder et al. teaches an apparatus as claimed, comprising: 

a) a user interface (see disclosure that the system can be used for filtering the results of Web 

searches, said web searches requiring a user interface, Abstract); 

b) a user interface/ event manager communicatively coupled to the user interface (see 

disclosure that the system can be used for filtering the results of Web searches, said 
web searches requiring an event handler to respond to user requests, Abstract); 

c) a generic data gathering device (see disclosure that the system can be applied to a group of 

documents found by the AltaVista spider, section 1 Introduction , beginning on page 

2); 

d) a generic information retrieval application, communicatively coupled to the user 

interface/ event manager (see disclosure that the system can be applied to a group of 
documents found by the AltaVista spider, section 1 Introduction , beginning on page 
2); and 

e) a data cleaning application for 

i) decomposing each page of a set of text documents into one or more pagelets (see 
disclosure that documents are decomposed into shingles, analogous to the 
claimed pagelets, in section 2 Defining Similarity of Documents , beginning on 
page 3); 
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ii) identifying all pagelets belonging to templates (see disclosure that common shingles 

were nearly all mechanically generated, including shared header or footer 
information on a large number of automatically generated pages, i.e. forms, 
analogous to the claimed templates, in section 5.1 Common Shing les, 
beginning on page 7); and 

iii) elimbating the template pagelets from a data set (see section 5.1 Common 

Shing les, beginning on page 7, particularly the disclosure on page 8 that 
common shingles either have no effect on the overall resemblance of the 
documents or they have the effect of creating a false resemblance between two 
basically dissimilar documents, and so common shingles are ignored), 
communicatively coupled to the generic data gathering application and to the generic 
information retrieval application. 

14. Regarding claims 2 and 14, Broder et al. additionally teaches a method and computer 
readable medium as claimed, wherein the set of text documents comprises a collection of HTML 
pages (see disclosure in the Abstract that the disclosed invention is applied to every document on 
the World Wide Web, page 1). 

15. Regarding claims 3 and 15, Broder et al. additionally teaches a method and computer 
readable medium as claimed, wherein the cleaning step comprises the steps of: 
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a) decomposing each page of the set of text documents into one or more pagelets (see 

disclosure that documents are decomposed into shingles, analogous to the claimed 
pagelets, in section 2 Defining Similarity of Documents , beginning on page 3); 

b) identifying all pagelets belonging to templates (see disclosure that common shingles were 

nearly all mechanically generated, including shared header or footer information on a 
large number of automatically generated pages, Le. forms, analogous to the claimed 
templates, in section 5.1 Common Shing les, beginning on page 7); and 

c) eliminating the template pagelets from a data set (see section 5.1 Common Shing les r 

beginning on page 7, particularly the disclosure on page 8 that common shingles either 
have no effect on the overall resemblance of the documents or they have the effect of 
creating a false resemblance between two basically dissimilar documents, and so 
common shingles are ignored). 

16. Regarding claims 7 and 19, Broder et al. additionally teaches a method and computer 
readable medium as claimed, wherein the step of identifying all pagelets belonging to templates 
comprises the steps of: 

a) calculating a shingle value for each page and for each pagelet in the set of documents (see 

disclosure of the calculation of fingerprint values on the shingles, in section 3 
Estimating the Resemblance and the Containment , beginning on page 4); 

b) eliminating identical pagelets belonging to duplicate pages (see section 5.1 Common 

Shing les, beginning on page 7, particularly the disclosure on page 8 that common 
shingles either have no effect on the overall resemblance of the documents or they 
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have the effect of creating a false resemblance between two basically dissimilar 
documents, and so common shingles are ignored); 

c) sorting the pagelets by their shingle value into clusters (see disclosure of the clustering 

procedure in section 4.1 The Clustering Algorithm , on page 7); 

d) enumerating the clusters (see disclosure of the clustering procedure in section 4.1 The 

Clustering Algorithm , on page 7); and 

e) outputting a representation corresponding to the pagelets belonging to each cluster (see 

disclosure of the clustering procedure in section 4.1 The Clustering Algorithm , on 
page 7). 



Claim Rejections - 35 USC § 103 
17. The following is a quotation of 35 U.S.C 103(a) which forms the basis for all obviousness 
rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 
102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the 
subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary 
skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the 
invention was made. 



18. The factual inquiries set forth in Grahamv. John Deere Co, 383 US. 1, 148 USPQ 459 (1966), 
that are applied for establishing a background for determining obviousness under 35 U.S.C 103(a) 
are summarized as follows: 

1. Determining the scope and contents of the prior art. 

2. Ascertaining the differences between the prior art and the claims at issue. 

3. Resolving the level of ordinary skill in the pertinent art. 

4. Considering objective evidence present in the application indicating obviousness or 
nonobviousness. 
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19. This application currently names joint inventors. In considering patentability of the claims 
under 35 U.S.C 103(a), the examiner presumes that the subject matter of the various claims was 
commonly owned at the time any inventions covered therein were made absent any evidence to the 
contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and 
invention dates of each claim that was not commonly owned at the time a later invention was made 
in order for the examiner to consider the applicability of 35 U.S.C 103(c) and potential 35 

U.S.C 102(e), (f) or (g) prior ait under 35 U.S.C 103(a). 

20. Qaims 4 and 16 are rejected under 35 U.S.C 103(a) as being unpatentable over Broder et al. 
("Syntactic Clustering of the Web") as applied to claims 1-3, 7, 9, 11, 13-15 and 19 above, and 
further in view of Chakiabarti et al. ("Enhanced Topic Distillation using Text, Markup Tags and 
Hyperlinks"). 

21. Regarding claims 4 and 16, Broder et al. teaches a system and apparatus substantially as 
claimed. 

Broder et al. does not explicidy teach a system and apparatus wherein the decomposing step 
comprises the claimed steps. 

Chakiabarti et al., however, teaches a system and apparatus wherein the decomposing step 
comprises the steps of: 
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a) parsing each text document into a parse tree that comprises at least one node (see 

disclosure that each HTML page is a Document Object Model (DOM) tree, p. 210, 
under section 3 Proposed Model and Algorithms) : 

b) traversing the at least one node of the tree (see disclosure that each HTML page is a 

Document Object Model (DOM) tree, p. 210, under section 3 Proposed Model and 
Algorithms : see also Figure 4, page 211, illustrating the finished tree wherein pagelets 
have been pushed to the leaves of the tree; see also section 3.2 Segmentation and 
Smoothing, page 211); 

c) determining if one of the at least one node comprises a pagelet (see disclosure that each 

HTML page is a Document Object Model (DOM) tree, p. 210, under section 3 
Proposed Model and Algorithms : see also Figure 4, page 211, illustrating the finished 
tree wherein pagelets have been pushed to the leaves of the tree; see also section 3.2 
Segmentation and Smoothing, page 211); and 

d) outputting a representation corresponding to the one of the at least one node if it 

comprises a pagelet (see disclosure that each HTML page is a Document Object 
Model (DOM) tree, p. 210, under section 3 Proposed Model and Algorithms : see also 
Figure 4, page 211, illustrating the finished tree wherein pagelets have been pushed to 
the leaves of the tree; see also section 3.2 Segmentation and Smoothing, page 211). 

It would have been obvious to one of ordinary skill in the art at the time of the invention to 
decompose an HTML document to arrive at a list of pagelets through the use of a parse tree, since it 
is important to bring in additional sources of information (like a tag tree structure) where possible, 
to combat topic drift and clique attacks (see page 210, col. 1, last paragraph). 
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22. Claims 10 and 12 are rejected under 35 U.S.G 103(a) as being unpatentable over Broder et 
al. ("Syntactic Clustering of the Web") as applied to claims 1-3, 7, 9, 11, 13-15 and 19 above, and 
further in view of Rodeheffer et al. (U.S. Patent 6,614,764). 

23. Regarding claims 10 and 12, Broder et al. teaches a system and apparatus substantially as 
claimed, further comprising: 

a) a pagelet identifier, communicatively coupled to the data cleaning application (see 

disclosure that documents are decomposed into shingles, analogous to the claimed 
pagelets, in section 2 Defining Similarity of Documents , beginning on page 3); 

b) a hypertext parser, communicatively coupled to the pagelet identifier (see disclosure in the 

Abstract that the disclosed invention is applied to every document on the World Wide 
Web, page 1; see also disclosure that documents are decomposed into shingles, in 
section 2 Defining Similarity of Documents , beginning on page 3); 

c) a template identifier, communicatively coupled to the data cleaning application (see 

disclosure that common shingles were nearly all mechanically generated, including 
shared header or footer information on a large number of automatically generated 
pages, i.e. forms, analogous to the claimed templates, in section 5.1 Common Shing les, 
beginning on page 7); and 

d) a shingle calculator, communicatively coupled to the data cleaning application (see 

disclosure of shingle construction, section 2 Defining Similarity of Documents , 
beginning on page 3). 
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Broder et al. does not explicidy teach a system and apparatus comprising a Breadth First 
Search (BFS) algorithm. 

Rodeheffer et aL, however, teaches the Breadth First Search (BFS) teachnique (see col. 34, 
lines 43-62). 

It would have been obvious to one of ordinary skill in the art at the time of the invention to 
incorporate a Breadth First Search (BFS) algorithm, since this would produce a spanning tree in 
which the path from each node to the root is as short as possible, and generally, shorter paths are 
better. Furthermore, the breadth- first search is also efficient (see col. 34, lines 51-62). 

Allowable Subject Matter 

24. Claims 8 and 20 are objected to as being dependent upon a rejected base claim, but would be 
allowable if rewritten in independent form including all of the limitations of the base claim and any 
intervening claims. 

25. The following is a statement of reasons for the indication of allowable subject matter 
The present invention is directed to a system and method for cleaning a set of hypertext 

documents in order to minimize violations of a Hypertext Information Retrieval rule set, including 
the steps of decomposing each page in the set of documents into one or more pagelets, calculating a 
shingle value for each page and pagelet in the document set, clustering the pagelets based upon their 
shingle value, for each clusters with more than one pagelet finding all hyperlinks connecting pages 
owning pagelets in the cluster and finding all undirected connected components of a graph induced 
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by the pages owning pagelets in the cluster, outputting any components with size greater than 1, and 
finally eliminating these pagelets as being part of a template. 

The closest prior art of record, Broder et al. ("Syntactic Clustering of the Web") teaches a 
system whereby the syntactic similarity of web pages are calculated, thus allowing the removal of 
duplicate elements of web pages. The reference anticipates the claimed decomposition of web 
pages, calculation of shingle values, clustering, and removal of duplicate or near-duplicate elements 
and/ or pages. 

However, Broder et al. fails to anticipate or render obvious the recited feature of analyzing 
the clusters in order to identify components belonging to templates through the links between pages 
containing the pagelets of the cluster, as in dependent claims 8 and 20. 

These features arc novel and non-obvious over the prior art of record. 

Conclusion 

26. The prior art made of record and not relied upon is considered pertinent to applicant's 
disclosure. 

Broder et al. (U.S. Patent 5,909,677) teaches a method for facilitating the comparison of 
two computerized documents. 

Broder et al. (U.S. Patent 6,119,124) teaches a computer-implemented method of 
determining the resemblance of data objects such as web pages. 
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Dean et al. (U.S. Patent 6,138,113) teaches a method for identifying pages that are near 
duplicates in a linked database. 

Broder et aL (U.S. Patent 6,230,155) teaches a method for facilitating the comparison of 
two computerized documents. 

Broder et al. (U.S. Patent 6,349,296) teaches a computer-implemented method of 
determining the resemblance of data objects such as web pages. 

Gomes et al. (U.S. Patent 6,615,209) teaches a duplicate detection technique that uses query 
relevant information to limit the portions of documents to be compared for similarity. 

Pugh et al. (U.S. Patent 6,658,423) teaches a duplicate and near-duplicate detection 
technique that assigns a number of fingerprints to a given document. 

Dean et al. (U.S. Patent 6,665,837) teaches a method for identifying related pages among a 
plurality of pages in a linked database such as the World Wide Web. 

Manber ("Finding Similar Files in a Large File System") teaches a a tool, called sj£ for 
finding all similar files in a large file system. 

Broder ("Some Applications of Rabin's Fingerprinting Method") teaches an implementation 
and several applications of Fabin's fingerprinting scheme that take considerable advantage of its 
algebraic properties. 

Agrawal et al. ("Fast Algorithms for Mining Association Rules") teaches two new 
algorithms for solving the problem of discovering association rules between items in a large database 
of sales transactions. 

Brin et al. ("Copy Detection Mechanisms for Digital Documents") teaches a proposed 
system for registering documents and then detecting either complete or partial copies. 
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Heintze ("Scalable Document Fingerprinting (Extended Abstract)") teaches an online 
system that provides reliable search results using modest resources and scales up to data sets of the 
order of a million documents. 

Broder ("On the Resemblance and Containment of Documents") teaches the mathematical 
properties of resemblance and containment and the efficient implementation of the sampling 
process using Rabin fingerprints. 

Fang etal. ("Computing Iceberg Queries Efficiendy") teaches an efficient algorithm to 
evaluate iceberg queries using very little memory and fewer passes over data when compared to 
current techniques that use sorting or hashing. 

Kumar et al. ("Trawling the Web for Emerging Cyber- Communities") teaches the 
systematic enumeration of emerging communities from a web crawl. 

W3C ("Document Object Model (DOM) Level 2 Core Specification, Version 1.0") is the 
specification for the Document Object Model. 

Davison ("Recognizing Nepotistic Links on the Web") teaches some of the issues 
surrounding the question of what links to consider and which to disregard when conducting link 
analysis in query results ranking. 

Crescenzi et al. ("RoadRunner Towards Automatic Data Extraction from Large Web 
Sites") teaches techniques for extracting data from HTML sites through the use of automatically 
generated wrappers. 

The following references, although not qualifying as prior art, are also of interest: 
Bar- Yossef et al. ("Template Detection via Data Mining and its Applications") teaches a 
practical solution for the template detection problem based on counting frequent item sets. 
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Haveliwala et al. ("Evaluating Strategies for Similarity Search on the Web") teaches a 
technique for automatically evaluating strategies for answering Related Pages queries using Web 
hierarchies, such as Open Directory, instead of user feedback 

Crescenzi et al. ("RoadRunner Automatic Data Extraction from Data-Intensive Web 
Sites") teaches a matching technique that automatically generates a common wrapper by exploiting 
similarities and differences among HTML pages sharing a similar structure. 

Laender et al. ("A Brief Survey of Web Data Extraction Tools") teaches a taxonomy for 
characterizing Web data extraction tools and provides a qualitative analysis and survey of major Web 
data extraction tools. 

Arasu et al. ("Extracting Structured Data from Web Pages") teaches an algorithm that takes 
as input a set of template-generated pages, deduces the unknown template used to generate the 
pages, and extracts as output the values encoded in the pages. 

Yi et al. ("Eliminating Noisy Information in Web Pages for Data Mining") teaches a noise 
elimination technique for pages on a web site, wherein a style tree is built to capture the common 
presentation styles among pages on the web site. 

Ma et al. ("Extracting Unstructured Data from Template Generated Web Documents") 
teaches a system that identifies web page templates and extracts the unstructured data. 
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Any inquiry concerning this communication or earlier communications from the examiner 
should be directed to Luke S. Wassum whose telephone number is 703-305-5706. The examiner can 
normally be reached on Monday-Friday 8:30-5:30, alternate Fridays off. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, 
John E. Breene can be reached on 703-305-9790. The fax phone number for the organization 
where this application or proceeding is assigned is 703-872-9306. 

In addition, INFORMAL or DRAFT communications maybe faxed directly to the examiner 
at 703-746-5658. 

Customer Service for Tech Center 2100 can be reached during regular business hours at 
(703) 306-5631, or fax (703) 746-7240. 

Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
maybe obtained from either Private PAIR or Public PAIR Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR system, 
see http:/ / pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, 
contact the Electronic Business Center (EBQ at 866-217-9197 (toll-free). 

Luke S. Wassum 
Art Unit 2177 
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5 May 2004 



